Haplotype-based variant detection from short-read sequencing 
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Abstract 

The direct detection of haplotypes from short-read DNA sequencing data requires changes to existing 
smaU-variant detection methods. Here, we develop a Bayesian statistical framework which is capable of 
modeling multiallelic loci in sets of individuals with non-uniform copy number. We then describe our 
implementation of this framework in a haplotype-based variant detector, FreeBayes. 

1 Motivation 

While statistical phasing approaches are n e cessary for the deterrn i nation of large-scale hap- 



lotype s t ructu re [Browning and Browningl . 120071 . iDelaneau et al.l . 120 12L iHowie et al.l . 12011 



Li et al.l . |2010| | , sequencing traces provide short-range phasing information that may be em- 
ployed directly in primary variant detection to establish phase between proximal alleles. 
Present read lengths and error rates limit this physical phasing approach to variants clus- 
tered within tens of bases, but as the cost of obtaining long sequencing traces decreases 



Branton et al.l . |2008| . IClarke et al.l . |2009| . physical phasing methods will enable the determi- 



nation of larger haplotype structure directly using only sequence information from a single 
sample. 

Haplotype-based variant detection methods, in which short haplotypes are read directly 
from sequencing traces, offer a number of benefits over methods which operate on a single 
position at a time. Haplotype-based methods ensure semantic consistency among described 
variants by simultaneously evaluating all classes of alleles in the same context. The use of 
locally-phased genotype data can lower the computational burden of genotype imputation 
by reducing the possible space of haplotypes which must be considered. Locally phased 
genotypes can be used to improve genotyping accuracy in the context of rare variations that 
can be difficult to impute due to sparse linkage information. Similarly, they can assist in the 
design of genotyping assays, which can fail in the context of undescribed variation at the 
assayed locus. Provided sequencing errors are independent, the use of longer haplotypes in 
variant detection can improve detection by increasing the signal to noise ratio of the genotype 
likelihood space that is used in analysis. This follows from the fact that the space of possible 
erroneous haplotypes expands dramatically with haplotype length, while the space of true 
variation remains constant, with the number of true alleles less than or equal to the ploidy 
of the sample at a given locus. 

The direct detection of haplotypes from alignment data presents several challenges to 
existing variant detection methods. As the length of a haplotype increases, so does the num- 
ber of possible alleles within the haplotype, and thus methods designed to detect genetic 
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variation over haplotypes in a unified context must be able to model multiallelism. However, 
most variant detection methods establish estimates of the lik elihood of polymorphism at a 



given loci using statistical models which assume biallelism [Lil. l201ll iMarth et al.l . Il999j and 



uniform, typically diploid, copy number DePristo et al.l . 1201 1| . Moreover, improper model 



ing of copy number impedes the accurate detection of small variants on sex chromosomes, 
in polyploid organisms, or in locations with known copy-number variations, where called 
alleles, genotypes, and likelihoods should reflect local copy number and global ploidy. 

To enable the application of population-level inference methods t o the detection of h ap- 
lotypes, we generalize the Bayesian statistical method described by Marth et al. jl999l | to 
allow multiallelic loci and non-uniform copy nur nber across the s amples under consideration. 
We have implemented this model in FreeBayes jCarrison . 



2 Generalizing variant detection to multiallelic loci and non-uniform 
copy number 

2.1 Definitions 

At a given genetic locus we have n samples drawn from a population, each of which has 
a copy number or multiplicity of m within the locus. We denote the number of copies of 
the locus present within our set of samples as M = 'Y^^=i^i- Among these M copies we 
have K distinct alleles, bi, . . . ,bx with allele frequencies fi, . . . , fx- Each individual has an 
unphased genotype Gi comprised of ki distinct alleles , . . . , bk^ and corresponding genotype 
allele frequencies fi-^, . . . , fij^ , which may be equivalently expressed as a multiset of alleles 
Bi : \Bi\ = rrii. For the purposes of our analysis, we assume that we cannot accurately discern 
phasing information outside of the haplotype detection window, so our Gi are unordered and 
all Gi containing equivalent alleles and frequencies are regarded as equivalent. Assume a set 
of Si sequencing observations r^^, . . . , r^^ = Ri over each sample in our set of n samples such 
that there are ^"^j^ Si reads at the genetic locus under analysis. Qi denotes the mapping 
quality, or probability that the read is mis-mapped against the reference. 



2.2 A Bayesian approach 

To genotype the samples at a specific locus, we could simply apply a Bayesian statistic 
relating P{Gi\Ri) to the likelihood of sequencing errors in our reads and the prior likelihood 
of specific genotypes. However, this maximum-likelihood approach limits our ability to 
incorporate information from other individuals in the population under analysis, which can 
improve detection power. 

Given a set of genotypes Gi, . . . , Gn and observations observations Ri, . . . , Rn for all 
individuals at the current genetic locus, we can use Bayes' theorem to related the probability 
of a specific combination of genotypes to both the quality of sequencing observations and a 
priori expectations about the distribution of alleles within a set of individuals sampled from 
the same population: 
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P{Gi, . . . , Gn\Rl, . . . , Rn) 



P{Gi, . . . , Gn)P{Rl, • • • , Rn\Gi, . . . , Gn) 
P{Rl, . . . , Rn) 



p(r< r IP p ^ _ PjGi, ■ ■ ■ , Cn) n"=i -P(-R»l'^») /r,N 
• • • ' • • • ' ^'^^ - Eva„...,a„(^(Gi, . . . , G„) n^i ^(^.IG.)) 

Under this decomposition, P{Ri, . . . , Rn\Gi, . . . , G^) = nr=i f{Ri\Gi) represents the hke- 
hhood that our observations match a given genotype combination (our data hkehhood), and 
P{Gi, . . . , Gn) represents the prior hkehhood of observing a specific genotype combination. 
We estimate the data hkehhood as the joint probabihty that the observations for a specific 
individual support a given genotype. We use a neutral model of allele diffusion conditioned 
on an estimated population mutation rate to estimate the prior probability of sampling a 
given collection of genotypes. 

Except for situations with small numbers of samples and alleles, we avoid the explicit 
evaluation of the posterior distribution as implied by ([2]), instead using a number of op- 
timizations to make the algorithm tractable to apply to very large datasets (see section 



2.3 Estimating the probability of a sample genotype given sequencing observa- 
tions, P{Ri\Gi) 

Given a set of reads Ri = ri-^, . . . ,ri^_ of a sample at a given locus, we can extract a set of ki 
observed alleles B'- = b[, . . . , h'^, which encapsulate the potential set of represented variants 
at the locus in the given sample, including erroneous observations. Each of these observed 
alleles 6J has a frequency within the observations of the individual sample : ^jli o'^ = Si 
and corresponds to a true allele bi. 

If we had perfect observations of a locus, P{Ri\Gi) for any individual would approximate 
the probability of sampling observations Ri out of Gj with replacement. This probability 
is given by the multinomial distribution in Si over the probability P{bi) of obtaining each 

fi 

allele from the given genotype, which is for each allele frequency in the frequencies which 
define Gi, 

P,«dG.)«miG.,.^n(!;)°' (3) 

Our observations are not perfect, and thus we must account for the probability of errors in 
the reads. We can use the per-base quality scores provided by sequencing systems to develop 
the probability that an observed allele is drawn from an underlying true allele, P{B'^\Ri). 
We assume a mapping between sequencing quality scores and allele qualities such that each 
observed allele 6J has a corresponding quality qi which approximates the probability that the 
observed allele is incorrect. 

Furthermore, we must sum P{Ri\Gi) for all possible Ri combinations V(-Rj G Gi : 
\Ri\ = ki) drawn from our genotype to obtain the joint probability of Ri given Gi, as 
each YliLi Pim^i) oiily accounts for the marginal probability of the a specific Ri given B'-. 
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This extends P{Ri\Gi) as follows: 




1=1 

In summary, the probability of obtaining a set of reads given an underlying genotype 
is proportional to the probability of sampling the set of observations from the underlying 
genotype, scaled by the probability that our reads are correct. As qi approximates the 
probability that a specific bi is incorrect, P{b'i\bi) = 1 — qi when bi G Gi and P{b'i\bi) = qi 
when bi ^ Gi. 



2.4 Priors for unphased genotype combinations, P{Gi, . . . , G„) 
2.4.1 Decomposition of prior probability of genotype combination 

Let Gi, . . . ,Gn denote the set of genotypes at the locus and fi,...,fk denote the set of 
allele frequencies which corresponds to these genotypes. We estimate the prior likelihood of 
observing a specific combination of genotypes within a given locus by decomposition into 
resolvable terms: 



P(Gi, ...,Gn) = P(Gi, . . . , G„ n A, . . . , /fc) (5) 

The probability of a given genotype combination is equivalent to the intersection of that 
probability and the probability of the corresponding set of allele frequencies. This identity 
follows from the fact that the allele frequencies are derived from the set of genotypes and we 
always will have the same /i, . . . , for any equivalent Gi, . . . ,Gn- 

Following Bayes' Rule, this identity further decomposes to: 

P(G'i,...,G'„n/i,...,/,) = P(G'i,...,G'„|/i,...,/fc)P(/i,...,/,) (6) 

We now can estimate the prior probability of Gi, . . . , Gn in terms of the genotype com- 
bination sampling probability, P(Gi, . . . , Gn\fi, ■ ■ ■ , fk), and the probability of observing a 
given allele frequency in our population, P(/i, . . . , fk)- 

2.4.2 Genotype combination sampling probability P{Gi, . . . , Gn\fi, • • • , fk) 

The multinomial coefficient {f^'^^ fj gives the number of ways which a set of alleles with 
frequencies fi, . . . , fk may be distributed among M copies of a locus. For phased genotypes 
Gi the probability of sampling a specific Gi, . . . ,Gn given allele frequencies /i, . . . , is thus 
provided by the inverse of this term: 

/ M 

P(Gi,...,G„|/i,. ..,/,)= (7) 

• • • ) Ik/ 

However, our model is limited to unphased genotypes because our primary data only 
allows phasing within a limited context. Consequently, we must adjust ([7]) to refiect the 
number of phased genotypes which correspond to the unphased genotyping Gi, . . . , Gn- Each 
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unphased genotype corresponds to as many phased genotypes as there are permutations of 
the alleles in Gj. Thus, for a given unphased geno typing Gi, . . . , G„, there are Y\7-^ ( f ) 

phased genotypings. 

In conjunction, these two terms provide the probability of sampling a particular unphased 
genotype combination given a set of allele frequencies: 



(8) 

In the case of a fully diploid population, the product of all possible multiset permutations 
of all genotypes reduces to 2^, where h is the number of heterozygous genotypes, simplifying 
dHD to: 

M ^ 



P(Gi,...,G„,|/i,...,/fc) = 2M „ ^'^ „ ) (9) 

2.4.3 Derivation of P{fi, . . . , fk) by Ewens' sampling formula 

Provided our sample size n is small relative to the population which it samples, and the popu- 
lation is in equilibrium under mutation and genetic drift, the probabili ty of observiri g a given 
set of allele frequencies at a locus is given by Ewens' sampling formula |Ewensl . 1972] . Ewens' 



sampling formula is based on an infinite alleles coalescent model, and relates the probability 
of observing a given set of allele frequencies to the number of sampled chromosomes at the 
locus (M) and the population mutation rate 6. 

The application of Ewens' formula to our context is straightforward. Let be the 
number of alleles among bi,...,bk whose allele frequency within our set of samples is /. 
We can thus transform our set of frequencies fi, . . . , fk into a set of non-negative frequency 
counts ai, . . . , : ^^Li f(^f = many fi, . . . , fk can map to the same Oi, . . . , Cm, 

this transformation is not invertible, but it is unique from Oi, . . . , Qm to fi, . . . , fk- 

Having transformed a set of frequencies over alleles to a set of frequency counts over 
frequencies, we can now use Ewens' sampling formula to approximate P{fi, . . . , fk) given 6: 

Ml ^ 6°"^ 

In the bi-allelic case in which our set of samples has two alleles with frequencies /i and 
/a such that fi + f2 = M: 

M' 6 

P{af, = 1, af, = 1) = M-i.. , .ff (11) 

Uz=l + ^) •/^•/2 

While in the monomorphic case, where only a single allele is represented at this locus in 
our population, this term reduces to: 
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PiaM 



(M- 1)! 



(12) 



In this case, P(/i, . . . , fk) = 1 — when M = 2. This is sensible as 6 represents the 
population mutation rate, which can be estimated from the pairw i se het erozygosity rate of 
any two chromosomes in the population |Tajima , 1983 , Watterson . 19751 . 



3 Direct detection of phase from short-read sequencing 

By modeling multiallelic loci, this Bayesian statistical framework provides the foundation for 
the direct detection of longer, multi-base alleles from sequence alignments. In this section we 
describe our implementation of a haplotype-based variant detection method based on this 
model. 

Our method assembles haplotype observations over minimal, dynamically-determined, 
reference-relative windows which contain multiple segregating alleles. To be used in the 
analysis, haplotype observations must be derived from aligned reads which are anchored by 
reference-matching sequence at both ends of the detection window. These haplotype ob- 
servations have derived quality estimations which allow their incorporation into the general 
statistical model described in section [2l We then employ a gradient ascent method to de- 
termine the maximum a posteriori estimate of a mutual genotyping over all samples under 
analysis and establish an estimate of the probability that the loci is polymorphic. 



3.1 Parsing haplotype observations from sequencing data 

In order to establish a range of sequence in which multiple polymorphisms segregate in 
the population under analysis, it is necessary to first determine potentially polymorphic 
windows in order to bound the analysis. This determination is complicated by the fact that 
a strict windowing can inappropriately break clusters of alleles into multiple variant calls. 
We employ a dynamic windowing approach that is driven by the observation of multiple 
proximal reference-relative variations (SNPs and indels) in input alignments. 

Where reference-relative variations are separated by less than a configurable number of 
non-polymorphic bases in an aligned sequence trace, our method combines them into a single 
haplotype allele observation. Hi. The observational quahty of these haplotype alleles is given 
as m.m{qiWb[ G H^, Qi), or the minimum of the supporting read's mapping quality and the 
minimum base quality of the haplotype's component variant allele observations. 



3.2 Determining a window over which to assemble haplotype observations 

At each position in the reference, we collect allele observations derived from alignments as 
described in 13.11 To improve performance, we apply a set of input filters to exclude alleles 
from the analysis which are highly unlikely to be true. These filters require a minimum 
number of alternate observations and a minimum sum of base qualities in a single sample in 
order to incorporate a putative allele and its observations into the analysis. 
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We then determine a haplotype length over which to genotype samples by a bounded 
iterative process. We first determine the allele passing the input filters which is longest 
relative to the reference. For instance, a longer allele could be a multi-base indel or a 
composite haplotype observation flanked by SNPs. Then, we parse haplotype observations 
from all the alignments which fully overlap this window, finding the rightmost end of the 
longest haplotype allele which begins within the window. This rightmost position is used to 
update the haplotype window length, and a new set of haplotype observations are assembled 
from the reads fully overlapping the new window. This process repeats until the rightmost 
end of the window is not partially overlapped by any haplotype observations which pass the 
input filters. This method will converge given reads have finite length and the only reads 
which fully overlap the detection window are used in the analysis. 

3.3 Detection and genotyping of local haplotypes 

Once a window for analysis has been determined, we parse all fully-overlapping reads into 
haplotype observations which are anchored at the boundaries of the window. Given these sets 
of sequencing observations r^^, . . . , = Ri and data likelihoods P{Ri\Gi) for each sample 
and possible genotype derived from the putative alleles, we then determine the probability 
of polymorphism at the locus given the Bayesian model described in section [21 

To establish a maximum a posteriori estimate of the genotype for each sample, we employ 
a convergent gradient ascent approach to the posterior probability distribution over the 
mutual genotyping across all samples under our Bayesian model. This process begins at the 
genotyping across all samples Gi, . . . ,Gn where each sample's genotype is the maximum- 
likelihood genotype given the data likelihood P{Ri\Gi): 

Gi, . . . ,Gn = argmax P{Ri\Gi) (13) 

The posterior search then attempts to find a genotyping Gi,...,Gn in the local space 
of genotypings which has higher posterior probability under the model than this initial 
genotyping. In practice, this step is done by searching through all genotypings in which 
a single sample has up to the A^th best genotype when ranked by P{Ri\Gi), and is a 
small number (e.g. 2). This search starts with some set of genotypes Gi, . . . , Gn = {G} and 
attempts to find a genotyping {G}' such that: 

P{{Gy\R,, ...,Rn)> P{{G}\Ru ...,Rn) (14) 

{G}' is then used as a basis for the next update step. This search iterates until conver- 
gence, but in practice must be bounded at a fixed number of steps in order to ensure optimal 
performance. As the quality of input data increases in coverage and confidence, this search 
will converge more quickly because the maximum-likelihood estimate will lie closer to the 
maximum a posteriori estimate under the model. 

This method incorporates a basic form of genotype imputation into the detection method, 
which in practice improves the quality of raw genotypes produced in primary allele detection 
and genotyping relative to methods which only utilize a maximum-likelihood method to 
determine genotypes. Furthermore, this method allows for the determination of marginal 
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genotype likelihoods via the marginalization of assigned genotypes for each sample over the 
posterior probability distribution. 

3.4 Probability of polymorphism 

Provided a maximum a posteriori estimate of the genotyping of all the individuals in our 
sample, we might like establish an estimate of the quality of the genotyping. For this, we 
can use the probability that the locus is polymorphic, which means that the number of 
distinct alleles at the locus, K, is greater than 1. While in practice the space of possible 
genotypings is too great to integrate over, it is possible to derive the probability that the 
loci is polymorphic in our samples by summing across the monomorphic cases: 

P{K > l\Ri, ...,Rn) = l- P{K = ...,Rn) (15) 

Equation (fT5|) thus provides the probability of polymorphism at the site, which is provided 
as a quality estimate for each evaluated locus in the output of FreeBayes. 

3.5 Marginal likelihoods of individual genotypes 

Similarly, we can establish a quality estimate for a single genotype by summing over the 
marginal probability of that specific genotype and sample combination under the model. 
The marginal probability of a given genotype is thus: 

P{G,\R„...,R^)= Yl Pi{G}\Ri,...,Rn) (16) 

V({G}:G,e{G}) 

In implementation, the development of this term is more complex, as we must sample 
enough genotypings from the posterior in order to obtain well-normalized marginal likeli- 
hoods. In practice, we marginalize from the local space of genotypings in which each indi- 
vidual genotype is no more than a small number of steps in one sample from the maximum 
a posteriori estimate of Gj, . . . , G„. This space is similar to that used during the posterior 
search described in section [5751 We apply ( TTB]) to it to estimate marginal genotype likelihoods 
for the most likely individual genotypes, which are provided for each sample at each site in 
the output of our implementation. 
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